Week 10.3 - Agents: The Current Tool Landscape (Including MCP)

🎯 What We'll Cover

This is the practical sub-lesson: a tour of the agent tools that actually exist in May 2026, organised by what they do. We cover three families here — coding agents, computer-use and browser agents, and the connector layer (MCP) that links agents to your data. The fourth family, research agents (“Deep Research” modes), gets its own treatment in 10.4, and the full honest free-versus-paid tool comparison — including Chinese options — lands in 10.5.

Two commitments carry through. First, every capability number is time-stamped and will date; treat the trajectory as the lesson and the figures as a snapshot. Second, we keep asking the question that matters most for a UCT student: what can you actually use without a credit card? The answer is more encouraging than you might expect, but it requires knowing where to look.

And we carry forward the two ideas from 10.1 and 10.2: an agent is a model plus a harness, and an agent's reliability is not the same as its accuracy. Both shape how you should read every claim a tool vendor makes.

💻 Coding Agents

Coding agents are the most mature family, and — as we saw in 10.1 — increasingly the most capable general agents too, because a tool that can read a folder, run a command, and edit a document is a research assistant whether or not the documents are code. The leaders as of May 2026 are Claude Code (Anthropic), Codex (OpenAI), and Cursor, with Cline and opencode as prominent open alternatives.

The honest access picture: the most capable coding agents are paid. Claude Code, Codex, and Cursor all sit behind subscriptions, and through 2026 the pricing models shifted repeatedly as the companies worked out how to charge for long-running, compute-hungry agents. For a student without a subscription, the realistic free routes are (a) the open-source agent frameworks — Cline, opencode, and LangChain's open deepagents — which you can point at a free or low-cost model, and (b) running a smaller open-weight coding model locally if you have a capable GPU. Both require setup and neither matches the polished paid tools, so we do not build the Week 10 activity around coding agents (see 10.6). They are best treated, for now, as a powerful option you should know exists rather than a baseline everyone can reach.

📚 The connection to Week 7

Everything Week 7 said about verifying AI-generated code applies with more force here. A chatbot writes a function you can read; a coding agent runs for an hour and hands you a result built from many steps, any of which could be silently wrong (10.2's compounding-error problem). The more autonomous the coding agent, the more the verification burden grows — not shrinks.

🖱️ Computer-Use Agents

A computer-use agent drives a graphical interface the way a person does — it looks at the screen, then clicks and types. The “looks at the screen” part is the same vision-language technology from Week 8: the agent reads screen pixels exactly as it reads a scientific figure, and it inherits Week 8's failure modes (it can describe what it sees far more reliably than it can extract precise values from it). The point of computer-use agents is that they work with any software, including the many tools that have no API.

The standard benchmark is OSWorld — 369 real tasks across desktop applications. Its trajectory is one of the steepest in AI:

When	System	OSWorld score
January 2025	OpenAI's first computer-use agent, at launch	38.1%
—	Human baseline (estimated)	~72%
Late 2025	Simular Agent S2 — first agent to cross the human baseline	72.6%
March 2026	GPT-5.4 (OSWorld-Verified)	75.0%
May 2026 (leaderboard top, clustered)	Claude Mythos Preview GPT-5.5 Gemini 3.5 Flash Claude Opus 4.7	79.6% 78.7% 78.4% 78.0%

By the time you read this, those May 2026 figures will already be wrong — the public OSWorld-Verified leaderboard was reshuffling week to week as this was written, with newer preview models displacing the ones named here. That is not a flaw in the table; it is the lesson. Check the live leaderboard for today's standings, and read whatever you find there through the caveat below.

⚠️ “Beats the human baseline” does not mean “does your work”

It is true and striking that computer-use agents now match or exceed the average human score on OSWorld. But read that claim through 10.2. OSWorld is 369 specific, well-defined tasks — a narrow slice of what “using a computer” means. A benchmark score is a single-number summary of exactly the kind that the Princeton reliability work warns hides operational flaws. In real, open-ended workflows, computer-use agents remain fragile: they misclick, they lose the thread on long tasks, and — per 10.2 — their outcome consistency is far below 100%. The honest summary is the one from 10.1: the bottleneck has moved from speed to judgement, not disappeared.

🌐 Agentic Browsers

A growing category wraps a computer-use agent inside a web browser, so it can browse, fill forms, and carry out multi-step web tasks on your behalf. For UCT students, one of these is genuinely accessible:

Comet (Perplexity) — free, worldwide

Perplexity's agentic browser. After a paywalled 2025 launch, Comet became free for all Perplexity account holders in 2026, across iOS, Android, Mac and Windows, with no regional restriction — which means it works from South Africa without a subscription or a foreign card. The free tier includes the assistant that answers questions and summarises pages; paid plans add a background assistant for longer multi-step tasks.

Dia & ChatGPT Atlas

Dia (from the team behind the Arc browser) and OpenAI's ChatGPT-integrated browsing are two other entrants. Availability and free tiers shift constantly; verify current access before relying on either. The category as a whole is young and the products are not yet stable.

⚠️ Agentic browsers are a prompt-injection magnet

This is 10.2's structural security hole at its most exposed. An agentic browser reads whatever web page you send it to — and any page can contain hidden instructions aimed at your agent. Never give a browsing agent the ability to act on accounts that matter (email, banking, institutional systems) while it roams the open web. Keep the permissions dial low: let it read and summarise, not send, buy, or delete.

🤔 Manus and the “General Agent” Claim

Manus is worth discussing precisely because it is a good case study in reading agent hype critically. Launched in invitation-only beta in March 2025 by Butterfly Effect (a Chinese-founded company that relocated to Singapore), it was marketed as a general autonomous agent — give it a goal, walk away, come back to a finished task. The launch demos generated enormous attention and a thriving secondary market in invitation codes.

Apply the week's discipline. Manus's headline capability claims are company-reported, and independent replication has been scarce — exactly the situation 10.1 and 10.2 tell you to treat cautiously, because a polished demo is a best case, not a reliability measurement. What is independently documented is the regulatory drama: in April 2026, China's National Development and Reform Commission blocked Meta's reported $2 billion acquisition of Manus, following a Ministry of Commerce investigation earlier in the year. That is a verifiable fact about the company; the “it can do anything” framing is not. Useful tool, genuine engineering — but a textbook example of why you separate what a vendor claims from what has been measured.

🔌 MCP: The Connector Layer

The Model Context Protocol (MCP) is the plumbing that lets an agent connect to your tools and data through a single standard interface — often described as “USB-C for AI”. Instead of every tool needing a bespoke integration with every model, a tool exposes an MCP server once and any MCP-speaking agent can use it.

MCP matters in 2026 because it stopped being one vendor's idea and became a cross-industry standard with remarkable speed. Anthropic introduced it in November 2024; OpenAI adopted it in March 2025; Google followed in April 2025. In December 2025 Anthropic donated the protocol to a Linux Foundation body (the Agentic AI Foundation, co-founded with Block and OpenAI), so it is now governed in the open rather than controlled by one company. In barely a year it went from a single company's proposal to a standard that all the major model providers speak — which is why it is worth a researcher's attention rather than just a developer's.

🔬 Why a researcher should care

MCP is what will connect agents to the tools you already use. MCP servers exist (or are emerging) for reference managers like Zotero, for GitHub, Google Drive, Notion, and more — meaning an agent can search your library, read your repository, or draft in your notes through one standard. On Claude.ai, a set of connectors is available even on the free plan (the nine creative connectors from 10.1 are MCP under the hood). We return to the specific research-relevant connectors, and how to use them safely, in 10.5.

The protocol's own March 2026 roadmap signals where it is heading: scaling the transport layer for production, refining agent-to-agent communication, maturing open governance, and enterprise features like audit trails and authentication — an evolution “from releases to working groups” that marks MCP's shift from experiment to infrastructure.

⚠️ The pragmatic and the security caveats

Two cautions. First, practically: MCP is not always the right tool. Simon Willison — among the most careful practitioner voices — has argued that for a lot of agent work, plain command-line utilities are simpler and more reliable than wiring up an MCP server. MCP is a standard, not a magic wand. Second, and more seriously: anything an agent reads through MCP is untrusted content, so MCP is another channel for the prompt injection from 10.2. A connector that can both read your email and browse the web is exactly the kind of capability combination that turns a hidden instruction on a web page into an action on your inbox. Connect deliberately; keep the permissions dial low.

🧭 Reading the Frontier: Real, Overclaimed, and Vapour

A closing honesty note — but in the spirit of Week 9, the honest move is calibration in both directions, not blanket scepticism. Some frontier agent claims are genuine and shipping; some are real but overclaimed; some are still vapour. The skill is telling them apart. Three examples that are easy to get wrong in either direction.

🧬 Autonomous science: real milestones, overclaimed framing

It would be wrong to dismiss autonomous-science agents as vapour. The milestones are real and documented. Sakana AI's AI Scientist-v2 produced a fully AI-generated manuscript that cleared peer review at an ICLR 2025 workshop — the first time a start-to-finish AI-written paper passed human review. DeepMind's AlphaEvolve discovered a way to multiply two 4×4 complex matrices using 48 scalar multiplications, improving on a record that had stood since 1969. Hypothesis-generation agents like DeepMind's Co-Scientist and FutureHouse's Robin are being used in real labs. These sit alongside the Week 9 results — Erdős problems, gluon amplitudes — as genuine AI contributions to research.

What is overclaimed is the framing “the AI did the science”. The workshop paper passed a deliberately lower bar than a main conference and was withdrawn by agreement; the system still makes errors a reviewer must catch. Most of these tools generate hypotheses or optimise within a well-defined space — the human still frames the problem, validates the result, and supplies the novel data the agent cannot gather itself. As one assessment put it, we are likely years from autonomous “true innovation”. So: real and accelerating, genuinely useful — and not the end-to-end autonomous discovery the headlines imply. Both halves are true.

🔄 Self-improving agents: bootstrapping is real; recursive self-improvement is not (yet)

Here too the honest answer is “partly real”. AI coding tools are now substantially used to build themselves: Anthropic engineers write much of Claude Code using Claude Code, and the Meta-Harness research from 10.1 automates the search for better harnesses. That is genuine bootstrapping — the tools are accelerating their own development, and it is not hype to say so.

But it is human-directed bootstrapping, not the autonomous recursive self-improvement the term usually invokes. An engineer wields Claude Code to write Claude Code faster, including reviewing and deciding what ships; the agent is not, on its own, rewriting itself in a closed loop and getting recursively more capable without people. “AI-accelerated development of AI tools” (real, shipping, important) is a different claim from “agents autonomously improving themselves” (not what is happening). Keep the two apart and you will read this corner of the hype accurately.

And some things genuinely are just vapour or rumour as of mid-May 2026 — for example the Anthropic general-work product circulating under the codename “Orbit” (from 10.1), which had not officially launched. A codename is not a product, and a demo is not a deployment. The pattern across all three examples is the same: find the verified core, name the overclaim around it, and do not let either the hype or the backlash do your thinking for you.

How to read this whole landscape

Three questions cut through almost any agent-tool announcement. Which harness? (10.1 — the model name alone tells you little.) How reliable, not just how accurate? (10.2 — a demo is a best case.) What can I actually access for free, from here? (this week's throughline.) Hold those three up against every tool below and in 10.5, and you will not be badly misled.

📖 Sources & Further Reading

OSWorld-Verified leaderboard and XLANG Lab, “Introducing OSWorld-Verified” — the computer-use benchmark, human baseline, and current standings.
Model Context Protocol (Wikipedia) — the cross-vendor adoption timeline and Linux Foundation governance.
MCP 2026 Roadmap (9 March 2026) — the protocol's own stated priorities.
Simon Willison on MCP — the pragmatic critique that MCP is not always the best abstraction.
Perplexity Comet · CNBC on Comet going free — the free agentic browser accessible from South Africa.
Manus (Wikipedia) — origin and the blocked Meta acquisition; note its capability claims are company-reported.
Sakana AI, The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066 — the autonomous-paper-through-peer-review milestone, with its own caveats.
Google DeepMind, “AlphaEvolve” — the 4×4 complex-matrix algorithm (48 multiplications) that improved on Strassen's 1969 record after 56 years.
C&EN (May 2026), “AI companies introduce new agent-based tools for scientific discovery” — overview of Co-Scientist, Robin, and the realistic limits.

👉 What Comes Next

Sub-Lesson 10.4 — RAG in 2026. We deferred the research-agent family to its own lesson, and here it is. Retrieval-augmented generation was the backbone of the literature tools in Week 5; we look at what has changed since — how much longer context windows have eroded the need for naive retrieval, where “agentic RAG” wins it back, and why evaluating any of this remains genuinely hard. Deep Research modes — the research agents — sit right at the centre of that story.